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ABSTRACT 

The Epigenomics resource at the National Center 
for Biotechnology Information (NCBI) has been 
created to serve as a comprehensive public reposi- 
tory for whole-genome epigenetic data sets (www. 
ncbi.nlm.nih.gov/epigenomics). We have con- 
structed this resource by selecting the subset of 
epigenetics-specific data from the Gene 
Expression Omnibus (GEO) database and then sub- 
jecting them to further review and annotation. 
Associated data tracks can be viewed using 
popular genome browsers or downloaded for local 
analysis. We have performed extensive user testing 
throughout the development of this resource, and 
new features and improvements are continuously 
being implemented based on the results. We have 
made substantial usability improvements to user 
interfaces, enhanced functionality, made identifica- 
tion of data tracks of interest easier and created 
new tools for preliminary data analyses. 
Additionally, we have made efforts to enhance the 
integration between the Epigenomics resource and 
other NCBI databases, including the Gene database 
and PubMed. Data holdings have also increased 
dramatically since the initial publication describing 
the NCBI Epigenomics resource and currently 
consist of >3700 viewable and downloadable data 
tracks from 955 biological sources encompassing 
five well-studied species. This updated manuscript 
highlights these changes and improvements. 

INTRODUCTION 

The field of epigenetics is garnering increasing amounts of 
interest in the scientific community. Epigenetics refers to 
the study of stable, often heritable, changes that influence 
gene expression that are not mediated by DNA sequence 
(1,2). Epigenetic mechanisms play crucial roles in chroma- 
tin state regulation, thereby influencing processes such as 
gene expression, DNA repair, and recombination. 



Although individual epigenetic features are tied to 
specific genomic locations and can be stably inherited 
through many rounds of cell division, these epigenetic 
features can be modified, or erased in response to devel- 
opmental cues or external and environmental stimuli 
(3-6). Just as these epigenetic mechanisms strongly influ- 
ence development and cellular processes, defects in these 
mechanisms can prove to be quite deleterious. It is now 
known that certain defects in epigenetic regulation can be 
linked to instances of human disease, including develop- 
mental defects, metabolic disorders and cancer (7-9). 
Additionally, links are being uncovered between the 
epigenome and more common complex diseases including 
psychosis, diabetes and asthma. A better understanding of 
the epigenomic factors that contribute to these disease 
processes will lead to additional strategies for treatment 
in the future. It has become a major driving force behind 
epigenomics research (10-13). 

Epigenetic modifications are varied and diverse yet fall 
into four major classes: post-translational modification of 
histone proteins, chromatin conformation/accessibility, 
DNA modification and non-coding regulatory RNA 
(3,14). These mechanisms have been intensely studied 
and are well characterized. Covalent modification of 
histone proteins can induce or relax the packaging con- 
straints of chromatin (15). Modification of DNA, specif- 
ically methylation of cytosine, is crucial for processes such 
as DNA imprinting, X-chromosome inactivation and long 
range silencing of genomic regions (3,16). Other modified 
forms of cytosine have more recently been discovered that 
show distinct genomic localizations and functions. These 
include 5-hydroxymethylcytosine, 5-formylcytosine and 
5-carboxylcytosine (17). Non-coding RNA molecules can 
interact with specific target mRNAs and trigger a cascade 
of events resulting in specific mRNA degradation (18-20). 
Chromatin accessibility and nucleosome positioning have 
also been determined to serve as epigenetic mechanisms. It 
is not uncommon to find elements that regulate gene 
expression (e.g. promoters, enhancers, insulators) in 
regions of the genome that are maintained as 'open' or 
accessible. These accessible regions can serve as binding 
sites for chromatin-modifying enzymes and other protein 
factors (21). These epigenetic mechanisms often act 
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in concert for more complex levels of regulation. For 
example, it has been observed that small non-coding 
RNA molecules can participate in directing DNA methy- 
lation, and that enhancer elements, often found in regions 
where chromatin is maintained as accessible, can encode 
for small non-coding RNA molecules themselves 
(18-20,22). 

The distribution of these epigenetic features throughout 
the genome constitutes what can be considered the cellular 
'epigenome'. The epigenome, unlike the genome itself, is 
dynamic, and localization of these epigenetic features is 
influenced by cell or tissue type, age, exposure to environ- 
mental stimuli or countless other factors. These factors 
make defining an organism's singular epigenome a 
daunting, if not impossible, task. Yet, owing to the 
complex and important roles that epigenetic phenomena 
play in human health and development, efforts to under- 
stand the human epigenome are underway. To address 
this, the NIH launched the Roadmap Epigenomics 
Project in 2007 (23). One of the goals of this project is 
to combine whole genome epigenetic analysis with high 
throughput sequencing to create a series of publically 
available reference epigenome maps. These maps will en- 
compass a wide array of cell lines, cell and tissue types 
from individuals at various developmental stages and 
health states. Other large scale efforts to map epigenetic 
features and gene regulatory elements are currently 
ongoing. This includes the ENCODE (ENCyclopedia of 
DNA Elements) project, Mouse ENCODE and 
modENCODE project for model organisms (24-27). The 
National Center for Biotechnology Information (NCBI) 
Epigenomics database was created as a repository for 
these data as well as from other independent labs not 
involved in these project initiatives (28). In this article, 
we will describe new features, improvements and current 
holdings in this growing resource. 



THE EPIGENOMICS DATABASE 

The organizational framework for the database involves 
studies, samples, experiments and genome tracks. At the 
lowest level, a genome track is a representation of a signal 
or annotation in the coordinate systems of a specific 
genome assembly. At present, all of the tracks in the 
database are molecular abundance graphs (e.g., enrich- 
ment of a modified histone). However, the database is 
designed to capture additional types of tracks (e.g., peak 
calls or chromatin state maps) once they become more 
widely available. An experiment refers to the laboratory 
assay that generated the raw data that were used to con- 
struct a track. It is possible for there to be multiple tracks 
for one experiment, either because several types of tracks 
were made or because there are variants for different 
assemblies of the genome. A sample refers to the biological 
material that was used for the experiment, which is an 
essential unit for allowing all of the results from one 
isolate to be grouped together. Finally, a study is a 
group of experiments with a common set of scientific aims. 

In building the Epigenomics resource, data are selected 
from Gene Expression Omnibus (GEO) database and 



subjected to additional processing and tracking. GEO is 
a database of large-scale molecular abundance data 
generated for functional genomics studies (29). Because 
most of the experiments in scope for epigenomics are 
based on sequencing methodologies, there are often com- 
panion submissions to the Sequence Read Archive (SRA) 
database (30). Genome tracks may be attached to GEO 
submissions as supplementary files in a wide variety of 
formats. For molecular abundance graphs, we currently 
accept WIG and bigWig files. These data are subjected to 
computational analysis to identify the likely genome 
assembly and to ensure that all genome coordinates are 
in a valid range. Tracks are given accession numbers, and 
changes are tracked using revision numbers and update 
dates. In some cases, submitted tracks that had been con- 
structed using an older genome assembly will be remapped 
to reflect the current state of the genome. In this case, the 
derived track has a separate accession number and 
revision chain to keep it distinct from the original submis- 
sion. Incidentally, for records in Epigenomics, links are 
provided to original submitted records and data at GEO 
and SRA. These raw unprocessed data can be accessed 
and downloaded by users who are interested in performing 
their own analyses. 

Layered on top of the basic track data are metadata in 
the form of controlled key terms and relationships with 
records in other databases. Many of these attributes 
pertain to properties of the biological material, such as 
cell type, differentiation state, health status and so forth. 
Others are properties of the experiment, such as assay type 
and (where relevant) specific antibodies. Relationships 
between experiment records and the original data in 
GEO and SRA are captured. Some of the larger studies 
may have links to a new NCBI database called BioProject, 
which describes the project aims and provides links to 
associated data (31). Finally, nearly all studies will ultim- 
ately have one or more publications, which are captured as 
links to PubMed citations and (where applicable) full-text 
articles in PubMed Central (PMC). 

The database holdings have grown dramatically over the 
past few years. As shown in Figure 1, the total number of 
database records has increased >3-fold since our previous 
report 2 years ago (values are from September of the 
indicated year). As of this writing, the database contains 
3708 genome tracks sourced from 955 biological samples. 
These data come from five well-studied species (Figure 2A). 
Given the large output from the Roadmap Epigenomics 
and ENCODE projects, it is not surprising that the bulk 
of the records (73%) are of human origin. There is also a 
significant amount of data from mouse (20%), but smaller 
amounts from the other model organisms. Data tracks 
reflect a variety of assay types (Figure 2B), dominated by 
histone modifications (47% of the total), but also including 
DNA methylation, chromatin accessibility and various 
chromatin-associated factors (including RNA polymerase, 
transcription factors and various histone-modifying 
enzymes). Although not epigenetic per se, gene expres- 
sion — either at the level of mRNA or small non-coding 
RNAs — is often assayed along with epigenetic marks in 
order to advance understanding of gene regulatory 
networks. Finally, various sorts of controls are often 
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included in submissions (input control, antibody control), 
and they are included in the resource to allow for use in 
normalization or quality assessment. 



THE EPIGENOMICS WEBSITE 

All content of the database is made publically available 
through the Epigenomics resource on the NCBI website 
(www.ncbi.nlm.nih.gov/epigenomics/). In the course of 
the past 2 years, we have made incremental improvements 
to the resource-based feedback from the user community, 
together with our own usability testing and web log 
analysis. In addition, the site has benefited from general 
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Figure 1. Track data in the NCBI Epigenomics resource over time. 
Holdings have increased by >3-fold in the time period spanning from 
2010 to 2012. Currently there are 3708 tracks available. 



enhancements of the NCBI search interface, including 
faceted searching, spelling correction and synonymy 
mapping (e.g. mapping 'ESC to 'embryonic stem cell'). 

An example of an improvement that came from user 
feedback is the extension of the Sample Browser interface 
to work with experiment records. This tool lists database 
records in a tabular (spreadsheet-like) format, with 
columns corresponding to various biological and experi- 
mental attributes. Now, users can easily switch back-and- 
forth between browsing these two classes of documents 
while retaining the familiar features — filtering the table 
content, sorting on any of the columns and exporting 
the information in a spreadsheet compatible format. 
Additionally, user feedback has shaped the specifics of 
the filtering options and choices of which columns to 
show by default (although users are free to change these 
defaults). The Sample Browser also serves as a starting 
point for other tools and features of the site, such as 
saving records of interest in user-defined collections, 
bulk data downloading and graphical visualization using 
the NCBI genome viewer [with a link to the corresponding 
view on the University of California, Santa Cruz (UCSC) 
Genome Browser]. 

Another line of enhancements was driven by web ana- 
lytics, starting with the observation that a large fraction of 
the free-text searches performed on the site did not yield 
any results. From inspection of a few hundred such 
queries, we concluded that about half of them were gene 
symbols or other words aimed at finding specific genes or 
gene families. Because all of the records in the database 
represent some form of whole-genome analysis, they are 
not indexed by gene symbol. To address this problem, we 




Figure 2. Composition of track holdings in the NCBI Epigenomics resource. (A) Percentage of holdings by species. Species include Homo sapiens 
(H. sapiens), Mus musculus (M. musculus), Arabidopsis thaliana (A. thaliana), Drosophila melanogaster (D. melanogaster) and Caenorhabditis elegans 
(C. elegans). (B) Percentage of holdings by assay type. Assay types include histone modifications, DNA methylation, chromatin accessibility, various 
chromatin-associated factors (including RNA polymerase, transcription factors and various histone-modifying enzymes) and small RNA and gene 
expression. 
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developed a search enhancement that would check for 
gene symbols in the query text and — when found — 
generate a 'featured result' with a direct link to a graphical 
genome view centered on the gene of interest. In these 
views, a selected set of reference tracks is displayed, 
leaving it to the user to substitute other tracks of 
interest. Follow-up analysis of the logs showed that 
~22% of all site visitors made use of the feature at some 
time during their session and ~30% of all empty results 
were 'rescued' by directing the user to some useful content. 
Clearly, genes are useful starting points for researchers, so 
we worked with the staff of the NCBI Gene database to 
add links to the Epigenomics genome views on records for 
human genes. When the updated views were deployed, 
there was a concomitant doubling of overall usage of the 
Epigenomics resource and a 4-fold increase in the number 
of genome views. To make the genome views more useful, 
we have added annotation tracks for CpG islands and 
clinically relevant sequence variations. 

We have developed a comparison tool that is aimed at 
identifying genes that have undergone some sort of change 
between one biological state and another — say, before and 
after differentiation. The tool works by identifying clus- 
tered genomic positions that show the most difference in 
one track compared with another and then connecting 
them to genes based on proximity. A short list of 
the genes with the most differences is generated, and the 
result includes small thumbnail graphics depicting the 
signal at that locus for each of the two samples. To give 
the user a general flavour for the kinds of genes involved, 
functional terms that occur frequently in the top genes are 
listed [terms come from the Genome Ontology (GO) or 
names of pathways in the NCBI BioSystems database]. 

A recently developed feature provides users with the 
ability to upload custom data tracks. To maintain 
privacy and to support long-term data storage, it is neces- 
sary for users to have an account and be signed in before 
uploading tracks. Accounts are free and may be estab- 
lished in the My NCBI part of the site (be aware that 
quotas on number of uploaded tracks per account may 
be adjusted over time depending on demand). This 
upload functionality supports popular file formats 
including BED and WIG. Data can be uploaded from a 
local file or a public URL. Along with the data track itself, 
a small number of metadata attributes may be entered if 
desired. On completion of the uploading process, custom 
tracks are listed in the Experiment Browser interface and 
may be viewed graphically alongside public tracks. 



CONCLUSION 

The Epigenomics database at NCBI was established to 
serve as a public resource for epigenomic data sets. Data 
have been collected from both large scale studies such as 
the NIH Roadmap Epigenomics project, ENCODE and 
modENCODE and from smaller single laboratory studies. 
The holdings in Epigenomics have increased more than 
3-fold over the previous 2 years, and continue to grow. 
We have implemented new tools and new features in 
response to user feedback including the ability to browse 



the database at the experimental level, compare and 
upload data tracks. We anticipate a continuing growth 
of data holdings, development of new tools and improve- 
ment of general usability in the future. It is our goal to 
continue providing a comprehensive public resource for 
epigenomic datasets that gives users, with varying 
degrees of knowledge in the field, the ability to analyse 
and explore epigenomic data sets. 
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